fix(bots): skip bot upsert when nothing changed to stop team-strip + reindex loop on boot#28128
fix(bots): skip bot upsert when nothing changed to stop team-strip + reindex loop on boot#28128joaopamaral wants to merge 2 commits into
Conversation
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
f5365ec to
71c6156
Compare
|
Addressed all four bot review comments in
|
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
71c6156 to
c85a520
Compare
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
There was a problem hiding this comment.
Pull request overview
This PR prevents a boot-time loop where bot users are upserted on every restart, which (because the in-memory bot User lacks teams) triggers updateTeams(...) to remove existing bot team memberships and causes repeated version bumps + search reindex storms. The fix adds a no-op short-circuit in UserUtil.addOrUpdateBotUser(...), and adds unit tests to ensure the short-circuit behavior.
Changes:
- Add a short-circuit in
UserUtil.addOrUpdateBotUserto skipcreateOrUpdatewhen key bot fields are unchanged. - Ensure
retrieveWithAuthMechanismalso loadsrolesso the short-circuit can compare them. - Add Mockito-based unit tests covering both the no-op path and the upsert path.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| openmetadata-service/src/main/java/org/openmetadata/service/util/UserUtil.java | Adds the short-circuit guard and expands the fields loaded for bot comparison. |
| openmetadata-service/src/test/java/org/openmetadata/service/util/UserUtilBotTest.java | Adds unit tests validating short-circuit vs. upsert behavior. |
|
The Java checkstyle failed. Please run You can install the pre-commit hooks with |
🟡 Playwright Results — all passed (14 flaky)✅ 4056 passed · ❌ 0 failed · 🟡 14 flaky · ⏭️ 103 skipped
🟡 14 flaky test(s) (passed on retry)
How to debug locally# Download playwright-test-results-<shard> artifact and unzip
npx playwright show-trace path/to/trace.zip # view trace |
|
Checking failing tests 👀 |
…reindex loop on boot `BotResource.initialize()` runs `UserUtil.addOrUpdateBotUser(user)` for every bot on every OM boot. The in-memory `User` built by `UserUtil.user(...)` does not have the `teams` field populated, so the PUT path through `userRepository.createOrUpdate -> UserUpdater.entitySpecificUpdate` runs `updateTeams(original, updated)` with `original.teams = [Organization]` (or the bot's real stored teams) and `updated.teams = null`. `updateTeams` then executes `deleteTo(user, HAS, TEAM) + assignTeams(null)`, which strips every stored team membership the bot had, bumps the user version, and triggers an Elasticsearch reindex of both the user and each affected team. With several bots this fires on every restart and produces the reindex storm plus "Circular dependency detected in team hierarchy for team: Organization" warnings in the boot logs. In one production deployment this added almost 3 minutes to every boot. Short-circuit when the incoming bot has no real change vs. the persisted row: compare `description`, `displayName`, and `roles`. If they all match, return the original user and skip the PUT entirely — no `UserUpdater`, no team strip, no version bump, no reindex. Two adjustments to make the guard actually fire: - `retrieveWithAuthMechanism` now also loads `"roles"` (was loading only `"authenticationMechanism"`); `description` and `displayName` are scalar JSON-column fields and were already populated by the base read. - Compare `roles` via `listOrEmpty(...)` on both sides because the database-loaded original returns an empty list while the freshly built in-memory user returns null, and `Objects.equals(null, [])` is false. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2b464bb to
652e6e8
Compare
| // Short-circuit when the incoming bot user has no real change vs. what's already in the | ||
| // database. Without this guard every OM boot calls into addOrUpdateUser -> | ||
| // userRepository.createOrUpdate, and UserUpdater.entitySpecificUpdate then runs | ||
| // updateTeams/updateRoles/etc. with the incoming `user.getTeams() == null`, which strips | ||
| // the bot's stored team relationships, bumps the version, and triggers an Elasticsearch | ||
| // reindex of the bot user (and any team membership change ripples into the team_search | ||
| // index too). With many bots this is a reindex storm on every restart. |
There was a problem hiding this comment.
💡 Quality: Overly verbose inline comment block in production code
The 7-line comment block (lines 335–341) explains the why behind the entire PR rather than documenting non-obvious behaviour at the call site. The PR description already captures the rationale. Per the project's "write self-documenting code; only document complex business logic or workarounds" guideline, this would be cleaner as a 1–2 line comment (e.g., // Skip upsert when nothing changed — avoids stripping teams/triggering reindex on boot). The rest is commit-message / PR-description material.
Was this helpful? React with 👍 / 👎
Code Review 👍 Approved with suggestions 5 resolved / 6 findingsPrevents redundant bot upserts during boot by adding a short-circuit comparison for bot attributes, successfully resolving the team-strip and reindexing loop. Please consider condensing the verbose inline comment block in the production logic. 💡 Quality: Overly verbose inline comment block in production code📄 openmetadata-service/src/main/java/org/openmetadata/service/util/UserUtil.java:335-341 The 7-line comment block (lines 335–341) explains the why behind the entire PR rather than documenting non-obvious behaviour at the call site. The PR description already captures the rationale. Per the project's "write self-documenting code; only document complex business logic or workarounds" guideline, this would be cleaner as a 1–2 line comment (e.g., ✅ 5 resolved✅ Quality: Fully qualified class names used instead of imports
✅ Quality: Swallowed RuntimeException uses flow-control exception pattern
✅ Quality: Unused private helper method
|
| Compact |
|
Was this helpful? React with 👍 / 👎 | Gitar
|



Summary
BotResource.initialize()runsUserUtil.addOrUpdateBotUser(user)for every bot on every OM boot. The in-memoryUserbuilt byUserUtil.user(...)does not have theteamsfield populated, so the PUT path throughuserRepository.createOrUpdate -> UserUpdater.entitySpecificUpdaterunsupdateTeams(original, updated)withoriginal.teams = [Organization](or the bot's real stored teams) andupdated.teams = null.updateTeamsthen executes:…which strips every stored team membership the bot had, bumps the user version, and triggers an Elasticsearch reindex of both the user and each affected team. With several bots this fires on every restart and produces the reindex storm we observed in production logs:
Repeated for
profiler-bot,governance-bot,usage-bot,ingestion-bot, … each one taking ~100–200 ms plus a team-side reindex. We also sawCircular dependency detected in team hierarchy for team: Organization. Skipping to prevent StackOverflowError.fromSubjectContextduring the same window — the boot-time team churn was tripping the cycle guard.In a real environment with several bots this added ~3 minutes to every boot.
Fix
Short-circuit
addOrUpdateBotUserwhen the incoming bot has no real change vs. the persisted row: comparedescription,displayName, androles. If they all match, return the original user and skip the PUT entirely — noUserUpdater, no team strip, no version bump, no reindex.Two small adjustments to make the guard actually fire:
retrieveWithAuthMechanismnow also loads\"roles\"(was loading only\"authenticationMechanism\").descriptionanddisplayNameare scalar JSON-column fields and were already populated by the base read.rolesvialistOrEmpty(...)on both sides because the database-loaded original returns an empty list while the freshly built in-memory user returnsnull, andObjects.equals(null, [])isfalse.The first call still hits the existing code path (no
originalUser-> guard skipped), so seeding new bots is unchanged.Reproducer
ingestion-bot,profiler-bot,governance-bot,usage-bot, …) is upserted byBotResource.initialize().Organization) team.Organization. Boot logs showfieldsDeleted=[name=teams, oldValue=[...], newValue=null]for every bot, each followed by user + team ES reindex log lines.fieldsDeleted=[teams]log lines appear. Boot time drops accordingly (~3 minutes saved in production).Test plan
Added
openmetadata-service/src/test/java/org/openmetadata/service/util/UserUtilBotTestwith two Mockito unit tests:addOrUpdateBotUserShortCircuitsWhenNothingChanged— mocksUserRepository, stubsgetByNameto return a stored bot whoseroles/description/displayNamematch the incoming user, callsaddOrUpdateBotUser(boundary), asserts the return is the same instance as the stored user, and verifiesuserRepository.createOrUpdate(...)was never called. Verified to fail before the fix (test reachesUserUtil.addOrUpdateUserand explodes on the unstubbedcreateOrUpdate); passes after the fix.addOrUpdateBotUserGoesThroughUpsertWhenDisplayNameChanged— same setup but with mismatchingdisplayName; verifiesuserRepository.createOrUpdate(...)is called, so we don't accidentally short-circuit on real changes.Local manual verification: spun the fix into the 1.12.7 backport branch we run in production, restarted, observed no
fieldsDeleted=[teams]log entries for the bot users and no follow-up bot/team reindex log lines. Boot duration dropped by ~3 minutes.Opening as draft for maintainer feedback on:
roles, description, displayName) is acceptable or you'd prefer a broader/narrower field check;🤖 Generated with Claude Code